AITopics | alt text

Collaborating Authors

alt text

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Moving object detection from multi-depth images with an attention-enhanced CNN

Shibukawa, Masato, Yoshida, Fumi, Yanagisawa, Toshifumi, Ito, Takashi, Kurosaki, Hirohisa, Yoshikawa, Makoto, Kamiya, Kohki, Jiang, Ji-an, Fraser, Wesley, Kavelaars, JJ, Benecchi, Susan, Verbiscer, Anne, Hatakeyama, Akira, O, Hosei, Ozaki, Naoya

arXiv.org Artificial IntelligenceDec-8-2025

One of the greatest challenges for detecting moving objects in the solar system from wide-field survey data is determining whether a signal indicates a true object or is due to some other source, like noise. Object verification has relied heavily on human eyes, which usually results in significant labor costs. In order to address this limitation and reduce the reliance on manual intervention, we propose a multi-input convolutional neural network integrated with a convolutional block attention module. This method is specifically tailored to enhance the moving object detection system that we have developed and used previously. The current method introduces two innovations. This first one is a multi-input architecture that processes multiple stacked images simultaneously. The second is the incorporation of the convolutional block attention module which enables the model to focus on essential features in both spatial and channel dimensions. These advancements facilitate efficient learning from multiple inputs, leading to more robust detection of moving objects. The performance of the model is evaluated on a dataset consisting of approximately 2,000 observational images. We achieved an accuracy of nearly 99% with AUC (an Area Under the Curve) of >0.99. These metrics indicate that the proposed model achieves excellent classification performance. By adjusting the threshold for object detection, the new model reduces the human workload by more than 99% compared to manual verification.

artificial intelligence, deep learning, machine learning, (20 more...)

arXiv.org Artificial Intelligence

2512.05415

Country:

Asia (0.96)
North America > United States (0.93)

Genre: Research Report > New Finding (0.93)

Technology:

Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Performance Analysis > Accuracy (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Using Knowledge Graphs to harvest datasets for efficient CLIP model training

Ging, Simon, Walter, Sebastian, Bratulić, Jelena, Dienert, Johannes, Bast, Hannah, Brox, Thomas

arXiv.org Artificial IntelligenceOct-1-2025

Training high-quality CLIP models typically requires enormous datasets, which limits the development of domain-specific models -- especially in areas that even the largest CLIP models do not cover well -- and drives up training costs. This poses challenges for scientific research that needs fine-grained control over the training procedure of CLIP models. In this work, we show that by employing smart web search strategies enhanced with knowledge graphs, a robust CLIP model can be trained from scratch with considerably less data. Specifically, we demonstrate that an expert foundation model for living organisms can be built using just 10M images. Moreover, we introduce EntityNet, a dataset comprising 33M images paired with 46M text descriptions, which enables the training of a generic CLIP model in significantly reduced time.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2505.02746

Country:

Europe > Germany (0.28)
Europe > Switzerland (0.28)

Genre: Research Report (1.00)

Industry:

Health & Medicine (1.00)
Transportation > Ground > Rail (0.67)
Information Technology (0.67)

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (1.00)
Information Technology > Artificial Intelligence > Representation & Reasoning > Semantic Networks (0.72)
(2 more...)

Add feedback

A vibe coding learning design to enhance EFL students' talking to, through, and about AI

Woo, David James, Guo, Kai, Yu, Yangyang

arXiv.org Artificial IntelligenceSep-12-2025

This innovative practice article reports on the piloting of vibe coding (using natural language to create software applications with AI) for English as a Foreign Language (EFL) education. We developed a human-AI meta-languaging framework with three dimensions: talking to AI (prompt engineering), talking through AI (negotiating authorship), and talking about AI (mental models of AI). Using backward design principles, we created a four-hour workshop where two students designed applications addressing authentic EFL writing challenges. We adopted a case study methodology, collecting data from worksheets and video recordings, think-aloud protocols, screen recordings, and AI-generated images. Contrasting cases showed one student successfully vibe coding a functional application cohering to her intended design, while another encountered technical difficulties with major gaps between intended design and actual functionality. Analysis reveals differences in students' prompt engineering approaches, suggesting different AI mental models and tensions in attributing authorship. We argue that AI functions as a beneficial languaging machine, and that differences in how students talk to, through, and about AI explain vibe coding outcome variations. Findings indicate that effective vibe coding instruction requires explicit meta-languaging scaffolding, teaching structured prompt engineering, facilitating critical authorship discussions, and developing vocabulary for articulating AI mental models.

artificial intelligence, large language model, natural language, (16 more...)

arXiv.org Artificial Intelligence

2509.08854

Country: Asia > China (0.30)

Genre:

Research Report (0.84)
Instructional Material > Course Syllabus & Notes (0.34)

Industry:

Leisure & Entertainment > Games (0.47)
Education > Curriculum > Subject-Specific Education (0.47)

Technology: Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)

Add feedback

Not All Visitors are Bilingual: A Measurement Study of the Multilingual Web from an Accessibility Perspective

Bhuiyan, Masudul Hasan Masud, Varvello, Matteo, Zaki, Yasir, Staicu, Cristian-Alexandru

arXiv.org Artificial IntelligenceAug-27-2025

English is the predominant language on the web, powering nearly half of the world's top ten million websites. Support for multilingual content is nevertheless growing, with many websites increasingly combining English with regional or native languages in both visible content and hidden metadata. This multilingualism introduces significant barriers for users with visual impairments, as assistive technologies like screen readers frequently lack robust support for non-Latin scripts and misrender or mispronounce non-English text, compounding accessibility challenges across diverse linguistic contexts. Yet, large-scale studies of this issue have been limited by the lack of comprehensive datasets on multilingual web content. To address this gap, we introduce LangCrUX, the first large-scale dataset of 120,000 popular websites across 12 languages that primarily use non-Latin scripts. Leveraging this dataset, we conduct a systematic analysis of multilingual web accessibility and uncover widespread neglect of accessibility hints. We find that these hints often fail to reflect the language diversity of visible content, reducing the effectiveness of screen readers and limiting web accessibility. We finally propose Kizuki, a language-aware automated accessibility testing extension to account for the limited utility of language-inconsistent accessibility hints.

artificial intelligence, machine learning, natural language, (21 more...)

arXiv.org Artificial Intelligence

2508.18328

Country:

Europe (1.00)
North America > United States > New York (0.28)
Asia > Middle East > UAE (0.28)

Genre: Research Report (0.64)

Industry: Health & Medicine > Therapeutic Area (0.34)

Technology:

Information Technology > Communications (1.00)
Information Technology > Human Computer Interaction (0.94)
Information Technology > Artificial Intelligence > Natural Language (0.88)
Information Technology > Artificial Intelligence > Machine Learning (0.68)

Add feedback

AccessGuru: Leveraging LLMs to Detect and Correct Web Accessibility Violations in HTML Code

Fathallah, Nadeen, Hernández, Daniel, Staab, Steffen

arXiv.org Artificial IntelligenceJul-29-2025

The vast majority of Web pages fail to comply with established Web accessibility guidelines, excluding a range of users with diverse abilities from interacting with their content. Making Web pages accessible to all users requires dedicated expertise and additional manual efforts from Web page providers. To lower their efforts and promote inclusiveness, we aim to automatically detect and correct Web accessibility violations in HTML code. While previous work has made progress in detecting certain types of accessibility violations, the problem of automatically detecting and correcting accessibility violations remains an open challenge that we address. We introduce a novel taxonomy classifying Web accessibility violations into three key categories - Syntactic, Semantic, and Layout. This taxonomy provides a structured foundation for developing our detection and correction method and redefining evaluation metrics. We propose a novel method, AccessGuru, which combines existing accessibility testing tools and Large Language Models (LLMs) to detect violations and applies taxonomy-driven prompting strategies to correct all three categories. To evaluate these capabilities, we develop a benchmark of real-world Web accessibility violations. Our benchmark quantifies syntactic and layout compliance and judges semantic accuracy through comparative analysis with human expert corrections. Evaluation against our benchmark shows that AccessGuru achieves up to 84% average violation score decrease, significantly outperforming prior methods that achieve at most 50%.

large language model, machine learning, natural language, (18 more...)

arXiv.org Artificial Intelligence

2507.19549

Country:

North America > United States (1.00)
Asia (0.67)
Europe > Germany > Baden-Württemberg (0.14)

Genre:

Research Report (1.00)
Overview (1.00)

Industry:

Health & Medicine > Consumer Health (1.00)
Education > Health & Safety > School Nutrition (0.93)

Technology:

Information Technology > Communications > Web (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

AltGen: AI-Driven Alt Text Generation for Enhancing EPUB Accessibility

Shen, Yixian, Zhang, Hang, Shen, Yanxin, Wang, Lun, Shi, Chuanqi, Du, Shaoshuai, Tao, Yiyi

arXiv.org Artificial IntelligenceDec-30-2024

Digital accessibility is a cornerstone of inclusive content delivery, yet many EPUB files fail to meet fundamental accessibility standards, particularly in providing descriptive alt text for images. Alt text plays a critical role in enabling visually impaired users to understand visual content through assistive technologies. However, generating high-quality alt text at scale is a resource-intensive process, creating significant challenges for organizations aiming to ensure accessibility compliance. This paper introduces AltGen, a novel AI-driven pipeline designed to automate the generation of alt text for images in EPUB files. By integrating state-of-the-art generative models, including advanced transformer-based architectures, AltGen achieves contextually relevant and linguistically coherent alt text descriptions. The pipeline encompasses multiple stages, starting with data preprocessing to extract and prepare relevant content, followed by visual analysis using computer vision models such as CLIP and ViT. The extracted visual features are enriched with contextual information from surrounding text, enabling the fine-tuned language models to generate descriptive and accurate alt text. Validation of the generated output employs both quantitative metrics, such as cosine similarity and BLEU scores, and qualitative feedback from visually impaired users. Experimental results demonstrate the efficacy of AltGen across diverse datasets, achieving a 97.5% reduction in accessibility errors and high scores in similarity and linguistic fidelity metrics. User studies highlight the practical impact of AltGen, with participants reporting significant improvements in document usability and comprehension. Furthermore, comparative analyses reveal that AltGen outperforms existing approaches in terms of accuracy, relevance, and scalability.

large language model, machine learning, natural language, (17 more...)

arXiv.org Artificial Intelligence

2501.00113

Country: North America > United States > California > San Diego County (0.15)

Genre: Research Report (1.00)

Technology:

Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.89)

Add feedback

Grounded Intuition of GPT-Vision's Abilities with Scientific Images

Hwang, Alyssa, Head, Andrew, Callison-Burch, Chris

arXiv.org Artificial IntelligenceNov-3-2023

GPT-Vision has impressed us on a range of vision-language tasks, but it comes with the familiar new challenge: we have little idea of its capabilities and limitations. In our study, we formalize a process that many have instinctively been trying already to develop "grounded intuition" of this new model. Inspired by the recent movement away from benchmarking in favor of example-driven qualitative evaluation, we draw upon grounded theory and thematic analysis in social science and human-computer interaction to establish a rigorous framework for qualitative evaluation in natural language processing. We use our technique to examine alt text generation for scientific figures, finding that GPT-Vision is particularly sensitive to prompting, counterfactual text in images, and relative spatial relationships. Our method and analysis aim to help researchers ramp up their own grounded intuitions of new models while exposing how GPT-Vision can be applied to make information more accessible.

alt, desc, gpt-vision, (15 more...)

arXiv.org Artificial Intelligence

2311.02069

Country:

North America > United States > New York > New York County > New York City (0.14)
Asia > Japan > Honshū > Kantō > Tokyo Metropolis Prefecture > Tokyo (0.04)
North America > United States > Pennsylvania (0.04)
(6 more...)

Genre:

Research Report > New Finding (0.48)
Research Report > Experimental Study (0.46)

Industry: Health & Medicine (0.68)

Technology:

Information Technology > Human Computer Interaction (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (0.95)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.95)

Add feedback

Concadia: Towards Image-Based Text Generation with a Purpose

Kreiss, Elisa, Fang, Fei, Goodman, Noah D., Potts, Christopher

arXiv.org Artificial IntelligenceOct-27-2022

Current deep learning models often achieve excellent results on benchmark image-to-text datasets but fail to generate texts that are useful in practice. We argue that to close this gap, it is vital to distinguish descriptions from captions based on their distinct communicative roles. Descriptions focus on visual features and are meant to replace an image (often to increase accessibility), whereas captions appear alongside an image to supply additional information. To motivate this distinction and help people put it into practice, we introduce the publicly available Wikipedia-based dataset Concadia consisting of 96,918 images with corresponding English-language descriptions, captions, and surrounding context. Using insights from Concadia, models trained on it, and a preregistered human-subjects experiment with human- and model-generated texts, we characterize the commonalities and differences between descriptions and captions. In addition, we show that, for generating both descriptions and captions, it is useful to augment image-to-text models with representations of the textual context in which the image appeared.

caption, machine learning, natural language, (23 more...)

arXiv.org Artificial Intelligence

2104.08376

Country:

North America > United States > California > San Francisco County > San Francisco (0.14)
North America > United States > Minnesota > Hennepin County > Minneapolis (0.14)
Europe > France (0.04)
(13 more...)

Genre: Research Report > New Finding (0.67)

Industry:

Media (0.93)
Information Technology > Services (0.46)

Technology:

Information Technology > Communications > Social Media (1.00)
Information Technology > Artificial Intelligence > Vision (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Text Processing (0.93)

Add feedback

Image Analysis 4.0 with new API endpoint and OCR model in preview

#artificialintelligenceOct-25-2022, 22:15:44 GMT

Enterprises and hobbyists alike have been using Azure Computer Vision's Image Analysis API to garner various insights from their images. These insights help power scenarios such as digital asset management, search engine optimization (SEO), image content moderation, and alt text for accessibility among others. We are thrilled to announce the preview release of Computer Vision Image Analysis 4.0 which combines existing and new visual features such as read optical character recognition (OCR), captioning, image classification and tagging, object detection, people detection, and smart cropping into one API. One call is all it takes to run all these features on an image. The OCR feature integrates more deeply with the Computer Vision service and includes performance improvements that are optimized for image scenarios that make OCR easy to use for user interfaces and near real-time experiences.

api endpoint and ocr model, image analysis 4, preview, (9 more...)

#artificialintelligence

Industry: Information Technology > Security & Privacy (0.33)

Technology:

Information Technology > Artificial Intelligence > Vision > Image Understanding (0.59)
Information Technology > Artificial Intelligence > Vision > Optical Character Recognition (0.57)

Add feedback

Writing for Search Engines: Optimize for Robots or People?

#artificialintelligenceJun-13-2022, 10:11:54 GMT

Google processes more than 8.5 billion searches every day. That's more than 100,000 searches per second, thousands of which could lead a user to a purchase. It's no wonder, then, that 60% of marketers list SEO as their number one inbound marketing priority. But generating organic traffic comes with challenges. Google has hundreds of billions of webpages in its index, competing for the top spots on search result pages.

audience, bot, google, (14 more...)

#artificialintelligence

Technology:

Information Technology > Information Management > Search (1.00)
Information Technology > Artificial Intelligence > Natural Language > Information Retrieval (0.45)

Add feedback